Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cull idle kernels #2215

Merged
merged 4 commits into from Apr 5, 2017
Merged

Cull idle kernels #2215

merged 4 commits into from Apr 5, 2017

Conversation

kevin-bates
Copy link
Member

@kevin-bates kevin-bates commented Feb 22, 2017

(Updated to include review recommendations below...)

This change introduces the ability to cull kernels that have been idle for a specified amount of time. Off by default, it is enabled by setting --MappingKernelManager.cull_idle_timeout to a positive value representing the number of seconds a kernel must remain inactive to be culled. This feature leverages activity tracking's last_activity value to determine when the last activity for a given kernel occurred. Typical values for cull_idle_timeout will be on the order of 43200 (12 hours) or 86400 (24 hours) seconds as described in this JKG issue. These changes originally targeted Jupyter Kernel Gateway per issue #226 but really belongs in the Notebook server.

An optional property has been introduced to adjust the interval in which idle kernels are culled provided they have exceeded the aforementioned timeout property. This property is adjusted by setting --MappingKernelManager.cull_interval to a positive value. If not set or non-positive, the default value of 300 (5 minutes) seconds will be used.

JKG Issue 227 has been opened to remove the activity tracking that was also in place in the JKG. Prior to that, there could be false reports of active kernels that have been culled by the underlying MappingKernelManager. If not removed, we should minimally adjust the JKG's activity tracking handler to consult the active kernel manager class on each request.

@@ -52,6 +58,26 @@ def _update_root_dir(self, proposal):
raise TraitError("kernel root dir %r is not a directory" % value)
return value

cull_kernels_after_minutes_env = 'CULL_KERNELS_AFTER_MINUTES'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing a way to set traitlets with environment variables is done in the kernel gateway code base, but it's not common in the notebook codebase.

Copy link
Member

@minrk minrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left a few comments inline. Let me know if you'd like any help finishing it up, or clarifying things.

You can push new commits to this same branch and it will update the PR.

@@ -52,6 +58,26 @@ def _update_root_dir(self, proposal):
raise TraitError("kernel root dir %r is not a directory" % value)
return value

cull_kernels_after_minutes_env = 'CULL_KERNELS_AFTER_MINUTES'
cull_kernels_after_minutes_default = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify the trait declarations for now. @parente is right that the environment variable declarations aren't generally used here, so we can have a single assignment:

cull_idle_timeout = Integer(0, config=True,
    help="...
)

without the @default generator or other variables defined.

Plus, I think it might be clearer to use seconds everywhere, rather than a _minutes timeout here and a _seconds timeout below. How about we call them:

cull_idle_timeout: timeout (in seconds) after which a kernel is considered idle and ready to be culled.
cull_interval: interval (in seconds) on which to check for idle kernels to cleanup...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - this has been addressed in the subsequent commit.

for kId, kernel in self._kernels.items():
self.cull_kernel(kId, kernel)

def cull_kernel(self, kId, kernel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cull_kernel can just take a kernel_id argument, and the above can be for kernel_id in list(self._kernels). The list will make a copy of the keys, which may be needed, as shutting down a kernel will remove it from the _kernels dictionary and you can't iterate over a dictionary while changing it, for example:

for kernel_id in list(self._kernels):
    self.cull_kernel_if_idle(kernel_id)

def cull_kernel_if_idle(self, kernel_id):
    kernel = self._kernels[kernel_id]
    ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - thanks.

if self._culler_callback is None:
loop = IOLoop.current()
if self.kernel_culling_interval_seconds <= 0: #handle case where user set invalid value
self.log.warn("Invalid value for 'kernel_culling_interval_seconds' detected (%s) - using default value (%s).",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.warn is deprecated in Python in favor of log.warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

if activity is not None:
dtNow = utcnow()
#dtActivity = datetime.strptime(activity,'%Y-%m-%dT%H:%M:%S.%f')
dtIdle = dtNow - activity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, we try to use _ to separate words, such as kernel_id, idle_time instead of camelCase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had noticed the convention previously but am finding this to be a very difficult habit to break - thanks for pointing it out.

Copy link
Member

@Carreau Carreau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think many people will be happy to see this ! It deserved a good note in what's new.

A couple of suggestion, but it's look ok as is for 5.1

if self.cull_interval <= 0: #handle case where user set invalid value
self.log.warning("Invalid value for 'cull_interval' detected (%s) - using default value (%s).",
self.cull_interval, self.cull_interval_default)
self.cull_interval = self.cull_interval_default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also warn for cull_interval < 60s ? it's technically not wrong, but it's likely that user might have thoughs it was minutes. Even anything below 300s seem it can be a mistake.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. I thought implementing a minimum timeout would be good (now that I've tested it :-)). 300 strikes me as good minimum. If folks are making changes in these areas and need shorter timeouts for testing, they can temporarily adjust the constant.

Positive values less than the minimum would result in a warning and auto-adjustment to the minimum (similar to what happens if the interval is <= 0 and culling is enabled).

So, to your comment above, the description would be...

"Timeout (in seconds) after which a kernel is considered idle and ready to be culled. Values of 0 or lower disable culling. The minimum timeout is 300 seconds. Positive values less than the minimum will be adjusted to the minimum."

(or is that too verbose?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Carreau a warning is a good idea.

@kevin-bates that sounds perfect. You might add (5 minutes) next to 300 seconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@@ -52,6 +58,15 @@ def _update_root_dir(self, proposal):
raise TraitError("kernel root dir %r is not a directory" % value)
return value

cull_idle_timeout = Integer(0, config=True,
help="""Timeout (in seconds) after which a kernel is considered idle and ready to be culled."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, I think a "Values of 0 or lower will disable culling" could be nice. But we can improve later in a subsequent PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

def cull_kernels(self):
self.log.debug("Polling every %s seconds for kernels idle > %s seconds...",
self.cull_interval, self.cull_idle_timeout)
for kernel_id in list(self._kernels):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the list() here is to reify the things you want to try to cull and make sure you don't modify the list while culling. Is that right ? If so it seem like something easily broken.

What do @minrk and @parente think about building the list of kernel to shutdown and shut them al down at once ?
That should make the flow a little bit easier to follow (and also to change the predicate if necessary).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine. I don't think getting a list of dictionary keys is easily broken.

Would this be clearer?

# identify the idle kernels to be culled
to_cull = [ kid for kid in self._kernels if self.should_cull(kid) ]
# cull them
for kernel_id in to_cull:
    self.shutdown_kernel(kernel_id)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do @minrk and @parente think about building the list of kernel to shutdown and shut them al down at once ?

Sounds good to me as long as there's appropriate checks in place so that an problem getting the status of any one kernel (if that's possible) doesn't prevent cleanup of others that should be culled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not really following the benefit of obtaining the list of cullable kernels, then walking that list to perform the cull when we could just as easily encounter an issue determining the cullable kernels. So, to address the point raised by @parente, what about something like the following? (Is that the right way to handle an unknown set of exceptions in python?)

        for kernel_id in list(self._kernels):
            try:
                self.cull_kernel_if_idle(kernel_id)
            except:
                self.log.error("The following error was encountered while checking the idle duration of kernel %s: %s",
                    kernel_id, sys.exc_info()[0])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works, too. There are plenty of valid ways to do it. A few notes:

  • a comment about copying the kernel id list so we don't modify the dict during iteration might help clarity
  • you can use log.exception to automatically log a traceback, so you don't need to use exc_info.
  • always use at least except Exception:, because except: catches things like KeyboardInterrupts that shouldn't be caught.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @minrk - very helpful. Here's what I've got...

        """Create a separate list of kernels to avoid conflicting updates while iterating"""
        for kernel_id in list(self._kernels):
            try:
                self.cull_kernel_if_idle(kernel_id)
            except Exception as e:
                self.log.exception("The following exception was encountered while checking the idle duration of kernel %s: %s",
                    kernel_id, e)

@minrk
Copy link
Member

minrk commented Feb 26, 2017

@kevin-bates excellent, thanks! We're in the process of finishing up the 5.0 release from the master branch, but this can be merged once that release has been pushed out.

@minrk
Copy link
Member

minrk commented Apr 5, 2017

5.0 is out, long live 5.1!

@minrk minrk merged commit 8d6460a into jupyter:master Apr 5, 2017
@kevin-bates kevin-bates deleted the cull-idle-kernels branch April 5, 2017 14:38
@parente
Copy link
Member

parente commented May 17, 2017

After finally getting a test bed setup for this change, I realize that it does not take into account whether the kernel is busy or not, or whether there are connections to the kernel or not. I'm going to submit a follow-on PR that adds settings for whether these attributes should affect the culling decision or not.

@ibigquant
Copy link

when will 5.1 go release?

@kevin-bates
Copy link
Member Author

@BigQuant - Sounds like its very soon per the latest Jupyter Weekly Summary (2017, week 29): https://groups.google.com/forum/#!topic/jupyter/NgR4XpUf66E

Notebook (Grant)

    Grant is closing out outstanding issues marked with 5.1 milestone
    Planning on releasing a 5.1 release candidate by EOW

@takluyver
Copy link
Member

Yup, we're hoping to put the release candidate out any day now.

@birdstar
Copy link

This is exactly the feature I wanted!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants